modeltime tune

Machine learning time series tidymodels Dacon R

modeltime 튜닝 방법 소개

dondon true
2022-03-02

데이터는 데이콘에서 제공하는 발전량 데이터 일부를 뽑아서 활용했다.

Preparations (준비작업)

Libraries

Data load

rdata <- read.csv("/Users/sangdon/Desktop/distill_blog/_posts/2022-03-02-modeltime-tune/rdata2.csv", fileEncoding = "CP949", encoding = "UTF-8")

Data overview (데이터 기본정보)

Data

head(rdata)
                 time 기온 강수 풍속 풍향 습도 증기압 이슬점   기압
1 2015-01-01 01:00:00 -4.4 0.00  5.4  340   47    2.1  -14.0 1020.3
2 2015-01-01 02:00:00 -4.6 0.00  4.9  340   50    2.2  -13.4 1020.3
3 2015-01-01 03:00:00 -4.7 0.05  6.2  320   50    2.2  -13.5 1020.7
4 2015-01-01 04:00:00 -5.0 0.00  5.0  320   56    2.4  -12.4 1020.6
5 2015-01-01 05:00:00 -5.0 0.00  5.5  320   52    2.2  -13.3 1020.4
6 2015-01-01 06:00:00 -5.3 0.00  4.0  340   58    2.4  -12.2 1020.7
  해면기압 일조 일사 전운량 시정 지면온도 floating warehouse dangjin
1   1024.0   NA    0      6 1500     -4.4       NA         0       0
2   1024.0   NA    0     NA 1750     -4.6       NA         0       0
3   1024.5   NA    0      6 2000     -4.7       NA         0       0
4   1024.4   NA    0      6 2000     -5.0       NA         0       0
5   1024.2   NA    0      6 2000     -5.0       NA         0       0
6   1024.5   NA    0      6 2000     -5.3       NA         0       0
  ulsan hour
1     0    1
2     0    2
3     0    3
4     0    4
5     0    5
6     0    6
glimpse(rdata)
Rows: 55,392
Columns: 20
$ time      <chr> "2015-01-01 01:00:00", "2015-01-01 02:00:00", "201…
$ 기온      <dbl> -4.4, -4.6, -4.7, -5.0, -5.0, -5.3, -5.7, -5.7, -5.4…
$ 강수      <dbl> 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
$ 풍속      <dbl> 5.4, 4.9, 6.2, 5.0, 5.5, 4.0, 5.0, 4.6, 6.7, 6.5, 6.…
$ 풍향      <dbl> 340, 340, 320, 320, 320, 340, 340, 340, 320, 320, 32…
$ 습도      <dbl> 47, 50, 50, 56, 52, 58, 58, 56, 57, 59, 60, 62, 58, …
$ 증기압    <dbl> 2.1, 2.2, 2.2, 2.4, 2.2, 2.4, 2.3, 2.3, 2.3, 2.5, 2.6…
$ 이슬점    <dbl> -14.0, -13.4, -13.5, -12.4, -13.3, -12.2, -12.6, -13.…
$ 기압      <dbl> 1020.3, 1020.3, 1020.7, 1020.6, 1020.4, 1020.7, 1021…
$ 해면기압  <dbl> 1024.0, 1024.0, 1024.5, 1024.4, 1024.2, 1024.5, 1025.1…
$ 일조      <dbl> NA, NA, NA, NA, NA, NA, NA, 0.0, 0.0, 0.6, 0.9, 0.5,…
$ 일사      <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.0…
$ 전운량    <int> 6, NA, 6, 6, 6, 6, 6, 3, 6, 6, 7, 8, 8, 8, 8, 6, 6, 8…
$ 시정      <dbl> 1500.000, 1750.000, 2000.000, 2000.000, 2000.000, 20…
$ 지면온도  <dbl> -4.4, -4.6, -4.7, -5.0, -5.0, -5.3, -5.7, -5.7, -5.4, …
$ floating  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ warehouse <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 12, 150, 195, 270, 254, 18…
$ dangjin   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 8, 227, 263, 356, 333, 225…
$ ulsan     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hour      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
skim(rdata)
Table 1: Data summary
Name rdata
Number of rows 55392
Number of columns 20
_______________________
Column type frequency:
character 1
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
time 0 1 19 19 0 55392 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
기온 0 1.00 12.21 10.36 -19.3 3.60 12.60 21.10 36.80 ▁▅▇▇▂
강수 0 1.00 0.12 1.05 0.0 0.00 0.00 0.00 102.70 ▇▁▁▁▁
풍속 0 1.00 1.96 1.56 0.0 0.70 1.60 2.90 11.70 ▇▃▁▁▁
풍향 0 1.00 164.29 129.50 0.0 20.00 180.00 290.00 360.00 ▇▁▃▂▆
습도 0 1.00 75.34 20.14 10.0 61.00 79.00 94.00 100.00 ▁▂▃▅▇
증기압 82 1.00 12.93 8.92 1.0 5.40 10.20 19.20 42.90 ▇▅▃▂▁
이슬점 85 1.00 7.37 11.04 -22.4 -1.70 7.30 16.90 30.20 ▁▆▇▇▅
기압 79 1.00 1014.05 8.36 983.6 1007.40 1014.30 1020.60 1036.30 ▁▃▇▇▂
해면기압 78 1.00 1017.37 8.48 986.6 1010.60 1017.60 1024.00 1039.70 ▁▃▇▇▂
일조 25453 0.54 0.52 0.45 0.0 0.00 0.60 1.00 1.00 ▇▁▁▁▇
일사 0 1.00 0.54 0.83 0.0 0.00 0.01 0.89 4.85 ▇▂▁▁▁
전운량 12427 0.78 5.23 3.84 0.0 1.00 6.00 9.00 10.00 ▇▂▃▅▇
시정 0 1.00 1754.20 1024.17 3.0 1100.00 1800.00 2029.00 6454.00 ▅▇▁▁▁
지면온도 0 1.00 12.21 10.36 -19.3 3.60 12.60 21.10 36.80 ▁▅▇▇▂
floating 28344 0.49 122.42 192.34 0.0 0.00 0.00 192.25 753.00 ▇▁▁▁▁
warehouse 2040 0.96 95.81 150.72 0.0 0.00 0.00 152.00 595.00 ▇▁▁▁▁
dangjin 2040 0.96 140.11 221.67 0.0 0.00 0.00 227.00 881.00 ▇▁▁▁▁
ulsan 3528 0.94 66.28 104.17 0.0 0.00 0.00 104.00 448.00 ▇▁▁▁▁
hour 0 1.00 11.50 6.92 0.0 5.75 11.50 17.25 23.00 ▇▇▆▇▇

Data preprocessing

rdata <- rdata %>% 
    select(-hour) %>% 
    mutate(time = ymd_hms(time)) %>% 
    filter(between(time, ymd('2018-03-01'), ymd('2021-01-31'))) 

rdata %>% glimpse()  
Rows: 25,609
Columns: 19
$ time      <dttm> 2018-03-01 00:00:00, 2018-03-01 01:00:00, 2018-03…
$ 기온      <dbl> 3.1, 2.8, 2.6, 2.0, 2.2, 4.1, 3.5, 2.2, 1.0, 0.3, 0.…
$ 강수      <dbl> 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 1.80, 0.00, 0.00…
$ 풍속      <dbl> 3.6, 0.7, 3.2, 1.9, 2.1, 4.4, 7.9, 6.4, 7.7, 8.9, 7.…
$ 풍향      <dbl> 340, 140, 320, 230, 180, 270, 320, 290, 320, 320, 32…
$ 습도      <dbl> 96, 97, 95, 97, 97, 97, 93, 86, 82, 71, 63, 58, 60, …
$ 증기압    <dbl> 7.3, 7.2, 7.0, 6.8, 6.9, 7.9, 7.3, 6.1, 5.4, 4.4, 4.0…
$ 이슬점    <dbl> 2.5, 2.3, 1.8, 1.5, 1.7, 3.6, 2.4, 0.0, -1.7, -4.3, -…
$ 기압      <dbl> 1001.3, 1001.9, 1002.6, 1002.8, 1003.0, 1001.8, 1002…
$ 해면기압  <dbl> 1004.9, 1005.5, 1006.2, 1006.4, 1006.6, 1005.4, 1006.3…
$ 일조      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0.0, 0.8, 1.0, 1.0, …
$ 일사      <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05…
$ 전운량    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ 시정      <dbl> 922, 4315, 2601, 1717, 1957, 571, 57, 436, 593, 593,…
$ 지면온도  <dbl> 3.1, 2.8, 2.6, 2.0, 2.2, 4.1, 3.5, 2.2, 1.0, 0.3, 0.6,…
$ floating  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 36, 313, 532, 607, 614,…
$ warehouse <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 33, 209, 296, 315, 474,…
$ dangjin   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 37, 318, 490, 550, 727,…
$ ulsan     <int> 0, 0, 0, 0, 0, 0, 0, 0, 4, 35, 71, 82, 334, 372, 3…
rdata %>% 
  summarise(across(.fns = ~sum(is.na(.))/length(.)))
  time 기온 강수 풍속 풍향 습도      증기압     이슬점        기압
1    0    0    0    0    0    0 0.001835292 0.00191339 0.001718146
     해면기압      일조 일사    전운량 시정 지면온도 floating
1 0.001679097 0.4545668    0 0.1552579    0        0        0
  warehouse dangjin ulsan
1         0       0     0

Univariate timeseries analysis

울산 지역의 전력 발전량 데이터만 활용할 것이기 때문에 날짜 변수를 제외한 나머지 변수는 제거했다.

ulsan <- rdata %>% 
  select(-c(dangjin, warehouse, floating)) %>% 
  select(time, ulsan) %>% 
  rename(date = time, value = ulsan)

Time series visualization

tidymodels의 경우 train/test 분리를 위해서 initial_split()을 활용했는데 시계열 데이터의 경우 특정 날짜를 기준으로 잘라야하기 때문에 Modeltime 패키지에 내장되어있는 initial_time_split()을 이용한다. 특정 날짜를 기준으로 자르고 싶을 경우 timeseries_split()을 이용할 수도 있다.

initial_time_split(data = ulsan, prop = 0.9) %>% 
  tk_time_series_cv_plan() %>% 
  plot_time_series_cv_plan(date, value,
                           .interact = TRUE, 
                           .title = "Partition Train / Test")

Split train/test

months <- 1

total_months <- lubridate::interval(base::min(ulsan$date),
                                    base::max(ulsan$date)) %/%  
                                    base::months(1)


prop <- (total_months - months) / total_months

splits <- rsample::initial_time_split(ulsan, prop = prop)


splits %>%
  timetk::tk_time_series_cv_plan() %>%  
  timetk::plot_time_series_cv_plan(date, value) 
resamples_tscv_lag <- time_series_cv(
    data = training(splits) %>% drop_na(),
    cumulative  = TRUE,
    initial     = "6 months",
    assess      = "8 weeks",
    skip        = "4 weeks",
    slice_limit = 6
)

resamples_tscv_lag %>%
    tk_time_series_cv_plan() %>%
    plot_time_series_cv_plan(date, value)
model_spec_nnetar <- nnetar_reg(
    seasonal_period = 7,
    non_seasonal_ar = tune(id = "non_seasonal_ar"),
    seasonal_ar     = tune(),
    hidden_units    = tune(),
    num_networks    = 10,
    penalty         = tune(),
    epochs          = 50
) %>%
    set_engine("nnetar")
parameters(model_spec_nnetar)
Collection of 4 parameters for tuning

      identifier            type    object
 non_seasonal_ar non_seasonal_ar nparam[+]
     seasonal_ar     seasonal_ar nparam[+]
    hidden_units    hidden_units nparam[+]
         penalty         penalty nparam[+]
set.seed(123)
grid_random(parameters(model_spec_nnetar), size = 10)
# A tibble: 10 × 4
   non_seasonal_ar seasonal_ar hidden_units       penalty
             <int>       <int>        <int>         <dbl>
 1               2           1            3 0.000000152  
 2               5           0            8 0.0000000207 
 3               2           1           10 0.00000000268
 4               1           2            7 0.00000140   
 5               1           0           10 0.00000137   
 6               5           2            9 0.000000488  
 7               2           2            3 0.00000000335
 8               4           0            4 0.00000000244
 9               3           0            1 0.0000000214 
10               5           0            7 0.00000457   
modeltime::non_seasonal_ar()
Non-seasonal AR Term (quantitative)
Range: [0, 5]
Seasonal AR Term (quantitative)
Range: [0, 2]
hidden_units()
# Hidden Units (quantitative)
Range: [1, 10]
penalty() %>% dials::value_sample(5)
[1] 4.568003e-08 3.786842e-02 2.872842e-10 2.642413e-06 9.755476e-03
set.seed(123)
grid_spec_nnetar_1 <- grid_latin_hypercube(
    parameters(model_spec_nnetar),
    size = 15
)
rec <- recipe(value ~ ., training(splits)) 
  #step_date(date, features = c("dow", "month", "year")) %>% 
  #step_mutate(hour = hour(date), day = day(date), month = month(date)) %>% 
  #step_dummy(all_nominal_predictors()) %>% 
  
rec %>% prep() %>% juice() %>% head()
# A tibble: 6 × 2
  date                value
  <dttm>              <int>
1 2018-03-01 00:00:00     0
2 2018-03-01 01:00:00     0
3 2018-03-01 02:00:00     0
4 2018-03-01 03:00:00     0
5 2018-03-01 04:00:00     0
6 2018-03-01 05:00:00     0
wflw_fit_nnetar <- workflow() %>% 
  add_recipe(rec) %>%
  add_model(model_spec_nnetar)
library(doFuture)

registerDoFuture()

n_cores <- parallel::detectCores()

plan(
    strategy = cluster,
    workers  = parallel::makeCluster(n_cores)
)


library(tictoc)

tic()
set.seed(123)
tune_results_nnetar_1 <- wflw_fit_nnetar %>%
    tune_grid(
        resamples = resamples_tscv_lag,
        grid      = grid_spec_nnetar_1,
        metrics   = default_forecast_accuracy_metric_set(),
        control   = control_grid(verbose = TRUE, save_pred = TRUE)
    )
toc()
69.998 sec elapsed
# ** Reset Sequential Plan ----

plan(strategy = sequential)
tune_results_nnetar_1
# Tuning results
# NA 
# A tibble: 6 × 5
  splits               id     .metrics          .notes   .predictions 
  <list>               <chr>  <list>            <list>   <list>       
1 <split [23511/1344]> Slice1 <tibble [90 × 8]> <tibble… <tibble [20,…
2 <split [22839/1344]> Slice2 <tibble [78 × 8]> <tibble… <tibble [17,…
3 <split [22167/1344]> Slice3 <tibble [90 × 8]> <tibble… <tibble [20,…
4 <split [21495/1344]> Slice4 <tibble [90 × 8]> <tibble… <tibble [20,…
5 <split [20823/1344]> Slice5 <tibble [90 × 8]> <tibble… <tibble [20,…
6 <split [20151/1344]> Slice6 <tibble [90 × 8]> <tibble… <tibble [20,…
tune_results_nnetar_1 %>% show_best(metric = "rmse", n = Inf)
# A tibble: 15 × 10
   non_seasonal_ar seasonal_ar hidden_units  penalty .metric
             <int>       <int>        <int>    <dbl> <chr>  
 1               3           1            1 8.01e-10 rmse   
 2               2           0            2 6.28e- 1 rmse   
 3               5           1            3 8.92e- 3 rmse   
 4               1           0            2 4.37e- 8 rmse   
 5               5           0            4 6.29e- 6 rmse   
 6               2           1            6 1.97e- 6 rmse   
 7               3           2            8 9.89e- 2 rmse   
 8               4           0            4 2.20e- 9 rmse   
 9               2           2            5 1.21e- 4 rmse   
10               4           1            7 4.43e-10 rmse   
11               3           2            9 7.53e- 5 rmse   
12               1           2            9 5.45e- 4 rmse   
13               0           1           10 2.35e- 2 rmse   
14               1           1            5 1.45e- 7 rmse   
15               0           1            8 6.40e- 7 rmse   
# … with 5 more variables: .estimator <chr>, mean <dbl>, n <int>,
#   std_err <dbl>, .config <chr>
g <- tune_results_nnetar_1 %>%
    autoplot() +
    geom_smooth(se = FALSE)
library(plotly)
ggplotly(g)
set.seed(123)
wflw_fit_nnetar_tscv <- wflw_fit_nnetar %>%
    finalize_workflow(
        tune_results_nnetar_1 %>% 
            show_best(metric = "rmse", n = Inf) %>%
            dplyr::slice(1)
    ) %>%
    fit(training(splits))

wflw_fit_nnetar_tscv
══ Workflow [trained] ════════════════════════════════════════════════
Preprocessor: Recipe
Model: nnetar_reg()

── Preprocessor ──────────────────────────────────────────────────────
0 Recipe Steps

── Model ─────────────────────────────────────────────────────────────
Series: outcome 
Model:  NNAR(3,1,1)[7] 
Call:   forecast::nnetar(y = outcome, p = p, P = P, size = size, repeats = repeats, 
    xreg = xreg_matrix, decay = decay, maxit = maxit)

Average of 10 networks, each of which is
a 4-1-1 network with 7 weights
options were - linear output units  decay=8.006317e-10

sigma^2 estimated as 841.5
pred <- predict(wflw_fit_nnetar_tscv, testing(splits))


testing(splits) %>% 
  bind_cols(pred) %>% 
  ggplot() + 
  geom_line(aes(x = date, y = value), color = "blue") + 
  geom_line(aes(x = date, y = .pred), color = "red") 

Citation

For attribution, please cite this work as

dondon (2022, March 2). DonDon: modeltime tune. Retrieved from https://dondon-blog.netlify.app/posts/2022-03-02-modeltime-tune/

BibTeX citation

@misc{dondon2022modeltime,
  author = {dondon, },
  title = {DonDon: modeltime tune},
  url = {https://dondon-blog.netlify.app/posts/2022-03-02-modeltime-tune/},
  year = {2022}
}